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ABSTBACT 

An attespt is sade to explore the use of subjective 
probabilities ini the analysis of itc;s ditta, espedially 
criterion* referenced iten data, leo assasptions are'^plicits (1) one 
Wants to obtain a naxisas asoant of inforsation vith Respect to an 
item esing a Ininiiaa iieaber of sobjects; and (2) once taV^tes is 
validated, it say veil be adninistered in the classical 
correct/incorrect namier« One vay to satisfy these assiiaptions is to 
initially adaiaister the anvalidated ites to a ssall nosbiir of 
subjects using confidence testing procedures* The)i subjective 
prdmbilities can he translated tp pseudo^lassical ites scores which 
are, at least in theory, guessiag^^free. Osing pseudo^classical 
scores, a relatively qpphisticated ites analy^Biil table can be 
constructed^ typical ites statistics can be calculated, and a kind of 
ites reliability independent of total test reliability^ can be 
assessed. As a useful bridge betveen subjective prqj^bilities aild 
classical correct/incorrect scores, pseudo*classical scores appear to 
be of potential use iu the analysiSNOf criterion^r^ferenced iteas. 
(Author/BC) 
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For norm- referenced testing classical correct/ 
incorrect administration and scoring procedures seem 
to be reasonably effective and useful • However, norm- 
referenced tests are usually relatively long; the scores 
from such tests are often normally distributed; floor 
and ceiling effects seldom occur in norm- referenced tests; 
and^^most iii5>ortantly , one is not very much concerned 
about the precise proportion of items a student can 
cuiswer correctly — rather.v one is concerned about the 
ability of the test to distinguish among s\3bjects. Each 
of these characteristics of a norm-referenced test argues 
directly or indirectly that the classical correct/incorrect 
procedure is reasonably adequate (or, at least, not 
grossly inadequate) for many norm- referenced tests • 

On the other hand, criterion-referenced tests are 
usually short; the scores from such tests are often 
negatively skewed — even severelyXso; ceiling effects 
are very common; and, most importantly, one is funda- 
mentally concerned about accurately estimating the pro- 
portion of items to which a student knows the answer 
(or possibly some other score) • This emphasis on accurate 
estimation of a student's score is especially critical in t 
criterion- referenced testing because there is seldom any 
external criterion measure for judging validity. 

Thus, in criterion-referenced testing it is very 
important to use every possible means of eliminating 
random (and systematic) errors of measurement. In 
particular, it seems to this author that it is iir^ortant 
to eliminate (or, at least, be able to estimate the effect, 
of) guessing. Now, it is very clear that, a considerable 



^ In this paper "decision- theoretic testing," 
"confidence testing," and "admissible prbbabi.lity 
measurement" are all synonymous. The reader is 
referred to Brennan (19 74) for a more complete version 
of this paper. 



amount of student guessing frequently occurs when a 
student if forced to pick one and only one alternative 
and the classical correct/incorrect scoring procedure is 
used; moreover, when the classical procedure is used, it 
is very difficult, if not impossible, to ascertain the 
magnitude of the effect of guessing upon student scores* 

Furthermore, since criterion-referenced tests are 
frequently short, it seems desirable to obtain as much 
informations as possible from each item; yet, using the 
classical procedure for administering and scoring an 
item, one merely knows whether or not the student got 
the item correct. In particular, using the classical 
procedure one does not obtain information with regard to 
the relative attractiveness 6f each alternative for ^ each 
student. This kind of information can be very ^useful 
in determining whether or not to revise a criterion- 
referenced test item. Thus, the classical procedure some- 
what limits the amou^it of information we obtain with 
regard to any given criterion-referenced test item. 

In short, from a criterion-referenced testing view- 
point, this author feels that the classical procedure for 
administering and scoring an item has two serious limita- 
tions: (a) scores obtained using this procedure incor- 
porate an indeterminable amount of guessing and (b) this 
procedure provides very liittle infosnnation with regard to 
any given item especially when relatively small numbers 
of students take the item. These points imply that when 
we use the classical procedure for craterion-refereilced 
testing, we may have less than adequate information for 
determining whether or not a criterion-referenced test 
item requires revision. ''^V 

Therefore, it is worthwhile to consider alternatives 
to the classical procedure. There are a number of points 
of view from which one could consider different procedurer;; 
Here we are interested in the ability of the procedure 
to aid us in item analysis. That is, our goal is to 
identify a procedure for administering an item that 
provides us with optimum data for determining whether or 
not the item needs to be revised; and, if possibl^e, these 
data should aid us in pinpointing the nature, of any 
difficulties with the item. For this purpose, we consider 
two potential procedures which we call the "elimination 
procedure" and the "confidence procedure." We find that 
the confidence procedure is the better of the two for 
our purposes. 



It should be noted that here we are not concerned 
about the kinds of scores typically obtained from the 
elimination and confidence procedures; rather, our 
primary concern is with the nature and amount o^ data 
collected when such procedures are used. Also^ we do 
not assume that once an item is administered using one 
procedure it will always be administered using that 
procedure. In fact, when we consider the confidence 
procedure, the manner in which we interpret the data 
provides us with a kind of guessing-f ree estimate of 
a person's classical score. Thus, once an item has 
been validated using the confidexice procedure, one can 
administer the item using the classical procedure. 

Two Alternatives to the Classical Procedure for 
inistering Items 

Elimination procedure . Coombs et al (1956) suggest 
a procedure for administering and scoring a test b^sed 
upon having students eliminate alternatives that they 
consider to be incorrect. Since a student may eliminate 
any number of alternatives for any test item, the 
elimination procedure provides some informatiiDn about 
the relative attractiveness of each alternative. 
However^^^ \the information pr^ovided is somewhat ambiguous 
in that, for example, if a student eliminates two alter- 
natives, we so not know whether or not the student feels 
more uncertain about one alternative than the other. 

' Also, let us consider the elimination procedure from 
another; point of view. As indicated previously, we are 
interested in a procedure's ability to provide us with 
a kind bf guessing- free estimate of a person's classical 
^ score. Let us call such an estimate a PCI score, 
indicating the probability (P) that a person's classical 
(C) score on an item is unity (1). If we know, for 
example, that a person guessed randomly on a four- 
alternative item, then PCI should be 0.25. The question 
is^ '^Can the kind of data collected using the elimination 
procedure provide us with an adequate basis for estimating 
a student's PCI score for an item?" 

Suppose, for example, that a student eliminates 
two alternatives for a four-alternative item. If we 
could assume that, when forced to pick one and only one 
alternative, the student would randomly pick one of the 
two non-eliminated' alternatives, then the PCI score for 
the student for the i^^em would be. 0.50. However, this 
assumption is not nece^arily valid; in fact, one could 
argue that ?C1 might. be any value between 0.50 and 1.00. 



Thus, it does not appear that the elimination procedure 
provides an adequate basis for estimating a student's 
PCI score for an itein^ Consequently, if the student 
were administered the item a large number of times, 
we don't have a very good basis for estimating the number, 
or proportion, of times the student would get the item 
correct under the classical scoring procedure, if the 
item is administered K times, this proportion should be 
K-PCl. 

Confidence procedure . In confidence testing , one 
obtains from each student a subjective probability that 
each alternative of a test r item is correct. There are 
a number of techniques that can be used to obtain these 
probabilities either directly or indirectly. This author 
prefers the technique usually called the "star* method 
in which a student is told to distribute a fixed number 
of *stars" or points over the alternatives of a test 
item. For exaunple, students might be tbld to distribute 
twelve points over the alternatives of a four-alternative 
item. The table below indicates some of the ways students 
might perform this task and the associated (subjective) 
probabilities. 

No. of Points ^probabilities 



^3 

The reader interested in a more in-depth discussion 
of confidence testing can consult de-Finetti (1965) , 
Echternacht (1972) , Savage (1971), and Shuford et al (1966) .2 
A great deal of the literature on confidence testing 
involves discussion of various procedures for scoring such 
tests, but this is not our concern in this chapter. 
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A* 
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PCI 


3 3 


3 


3 


.25 


.25 


.25 


.25 


.25 


4 4 
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0 


.33 


.33 


.33. 


.00 


. 33 


6 6 
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.50 


.50 


.00 


.00 


.50- 


12 0 


0 


0 


1.00 


.00 


.00 


.00 


1.00 


5 5 


1 


1 


.42 


.42 


.08 


.08 


.50 


5 2 


4 


1 


. .42 


.17 


.33 


.08 


1.00 



^Appendix A to Brennan (1974) is a manual for DEC-TEST, 
a oomputer program that analyzes confidence test data 
in great detail. Further, the introduction to this manual 
provides a description j)f confidence testing and elimination 
testing as these pi:oc^du are typically used.^^ — 



Here we are concerned about the nature of the data 
(i.e., the probabilities) collected for each item and 
for each student. 

Each probability indicates how confident the * 
student is that the particular alternative is the correct 
answer for the item. Using these probabilities we can 
obtain PCI scores from the following rules: 

Let M = the magnitude of the highest probcd>ility 

for ^ particular student for a given item, 

A s the number of alternatives for the item, 

P{a) - the probability associated with alternative 
a (a = 1, 2, . • • , A) , and 

* = the correct alternative. 

Now, 

PCI =r 0 if P(*) ^ M; 

PCI = 1/K if P(*) = M and there are (K-l)other 
alternatives having P (a) ^ M; and 

PCI = 1 if P(*) = M and there are not other 
alternatives having P (a) - M. 

See the table on the previous page for examples of PCI 
scores. Note, in particular, that the third and fifth 
students both have PCI 0.50 eventhough M = 0.50 for 
the third student and M = 0.42 for the fifth student;* 

Thus, PCI scores are readily available from the 
subjective probabilities one obtains uising the confi- 
dence' testing procedure. Furtheinnore , when one uses 
confidence testing as a procedure to collect data for 
items, one obtains,, for each student, a probability 
associated with each alternative for each item. Thus, one 
has a great deal of information for each item — » much 
more information' than if students pick one alternative 
or eliminate alternatives. 

In short, the confidence procedure seems to be 
superior to the elimination procedure^ at least for out 
purposes here. 
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Item Analysis Tables from the Confidence Procedure 

Conisder the synthetic data for a hypothetical 
item presented in Table 1* The item has four alternatives 
*A" is the correct answer, and the twenty students are 
partitioned into lower and upper groups of ten students 
each. The confidence probabilities are indicated for 
each alternative and for each student. We emphasize that 
these are synthetic data, and they are not necessarily 
indicative of a good criterion-referenced test item, 
we use these data mex-ely to illustrate our discussion. 

For each confidence probability in Table 1# there 
is a pseudo-classical score. A pseudo classical score 
for an alternative is defined as the probability that a 
student would pick the alternative if the student were 
forced to choose one and only one alternative for the 
item under consideration. Thus, the pseudo-classical 
score for an item is the pseudo-classical score for the 
correct alternative; also, the pseudo-classical score for 
an item is identical to the PCI score discussed previously. 

Using the data in Table 1, one can construct the. 
item analysis tables given by Tables 2 and 3, where 
Table 2 uses confidence probabilities and Table 3 uses 
pseudo-classical scores. Both tables present frequency 
distributions of scores on alternatives, with associated 
totals, means, ^and standard deviations. Clearly, Table 
2 provides more information, and a somewhat different 
kind of information than Table 3 ; and, both tables 
provide much more information than is available from item 
analysis tables based upon the classical corrept/incorrect 
scoring procedure. This additional information can be 
quite useful in deciding what (if anything) is wrong with 
a criterion-referenced test item. 

Now, let us summarize a few points implicit in our 
discussion thus far. We are assuming that- g ice an item 
is validated it probably will be administered using the 
classical correct/incorrect scoring procedure. However, 
in order to validate the item we are suggesting that the 
evaluator collect confidence probabilities for each 
alternative, translate these probabilities to pseudo- 
classical scores! for each alternative, and generate the 
pseudo-classical item analysis table. This table indicates 
the probability the each student would pick each alter- 
native using the classical correct/incorrect scoring 
procedure; thus, using this table one can analyze the 
probable effect of guessing upon the performance of other 
similar students who take the item using the classical 
procedure for item administration and scoring. Further, 



TABLE 1 
Synthetic Data 



Stu- 



Confidence 
Probabilities 



Pseudo-classical 
scores 
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.00 


.00 


.00 


OP 5 


.30 


.20 


.30 


- .20 


.50 
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.00 


.33 


8 


.20 


.70 


.00 


.10 
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. .20 
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.40 
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.50 


10 
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.00 


1.00 


. .00 


.00 


.00 


1.00 


.00 


.00 


Sum-L 


3.60 


3.80 


1.00 


1.60 


3.83 


3.83 


1.00 


1.33 
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SD-L 
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.25 


.25 


1 2 


1 00 


o'o 


00 


00 

• W V 


1 00 
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• w w 


13 


1.00 


.00 


.00 


.00 
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« 15 
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.00/ 
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Di O 1 £ 
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.00 


.00 


DO 17 


.40 
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1.00 


.00 


.00 


18 
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.QO 
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.00 
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19 
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1.00 
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.00 
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20 
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.33 


.33 


•33 


.00 




6.05 


2,35 


.85 


.75 


6.58. 


2.58 


.58 


.25 


Mean'-U 


.61 


.24 


.09 


.08 


.66 


.26 


.06 


.03 


SD-U 


.25 


.20 


.11 


.09 


.37 


.32 


.12 


.08 


Sum-T^ 


9.65 


6.15 


1.85 


2 . 35 


10.41 


6.41 


1.58 


1.58 


Mean-T 


.48 


.31 


.09 


.12 


.52 


.32 


.08 


.08 


SD-T 


.28 


.25 


.12 


.12 


.12 


.34 


.15 


. 15 
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A pseudo-classical score for an alternative reprd*- 
sents the probability that a student would pick the 
alternative if the student were forced to pick one and 
only one alternative for the test item^ 

^L, U, and T mean the lower, upper, and total groiaps, 
respectively. I 
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if one wants a detailed display of . the certainty with 

which students choose any alternative, one can generate 

the item analysis table based upon the confidence 
probabilities. 

Admittedly, the Ideas discussed above require 

detailed procedures for item administration, scoring, ^ 
and analysis? however, the additional time and effort 
required can, I think, be very worthwhile for the process 
of validating items. 

An Application of PCI Scores in the Classical Test 
TKeory Model 

Recall that under the classical test theory model 
X s= T + E, where X, T, and E are observed, true, and 
random error scpres, respectively. Now, we have 
described the PCI item score for a student as a kind of 
guessing-f ree ejstimate of a person's classical score, 
and guessing is | usually interpreted as one kind of 
random error. If we assume that guessing is the only, 
or the principal, kind of random error that concerns 
us, then a PCI score is a Jcihd of true score and we 
can analyze the effect of guessing upon classical scores 
by using the classical test theory mode]^ directly. Thus, 
in this section we will let 

X = 0 or 1 (classical observed score), 

- T = PCI item score, and 

E = random ^rror due to guessing; 

Basic statistics . Note that when one typically uses 
the classical test theory model, one haa observed scores, 
and one wants to estimate, true scores; however, in this 
cas^, we already have the true scores, and we must esti- ' 
mate the observed scores. Now, if the item were admin- 
istered to student i a total of K times we would expect 
student i to get the item correct K-T. times, and we would 
expect student i to get the item ^ incorrect K-(l-T.) 
times. Therefore, if N is the total number of subjects ^ 



_ 1 N 

X = — Z K.T. (1 .) 



= T(l - T) . ( 2 ) 

For an example of these statistics see Table 4 vrhich 
uses the synthetic data presented in Table 1 and 
assuinfes, for the sake of illustration, that K = 12. 

Table 4 also indicates the error scores associated 
with each observed score for our synthetic data. The . 
mean and variance of the. error scores are given byt 



1 K N 
E = — Z vl (X. . - T.^) 
KN j=l i=l 



1 N 1 N 

= — Z K«T. - - E T. 
KN i=l ^ N i=l ^ 



= T - -T 

«= 0 ( 3 ) 



1 K N 5 ' ^ 

ahd s; = — Z t (X. . - T )^ 
^ KN j=l i=l ^3 

1 N 1 K - 
. = - r I - Z (X. . ^ T. ] 
N i=l K j«l 

I N IK 2 K 1 K , 

= - Z [ - Z X. . Z X. .T. . + — Z T7. ) 

N i=l K j«=l K j=l K j=l 

\ ■ ^'^ ■ 
IN 2 , 1 

= - Z I T. - - (K.Tf) + - (K-Tr> J 

N i=l ^ K ^ K /■ 

1. N 

= - Z (T. - Tt) 
N i=l ^ ^ 

I N 

= - Z T. (1 - T. ) (4 ) 

N i=l ^ ^ 
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2 2 2 

Now, let us demonstrate that = + . 



4 4 



1 N , 

[ - z 



1 N 
t - Z T. (1 
N i=l ^ 



T.) 1 



= - ZT 

N i ^ 



- t2 



N i 



N i 



= s. 



T - 
T(l 
2 



t2 



- T) 



Thus, we have demonstrated that, by interpreting 
our PCI scores as true scores we can express the mean and 
variance of observed scores in terms of the true scores • 
Furthermore, we have shown that the variance of the 
observed scores does indeed equal the variance of the 
true scores plus the variance of the ""error scores . 
The mean and variance of the observed, true, and error 
scores are provided in Table 6-4. For reference now 
and later, the reader should note that, for our synthetic 
data 

20 

Z T. =il0.41 , 
i=l ^ 



20 2 

E TT 

1=1 ^ 

20 3 

I TT 

i=l ^ 



= 7.9053 , and 



= 6.8687 . 
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Reliabililiy of a one - item test . Using the above 
results, we can express the reliability of a one-item 
test as: , 

^11 ' 4 ^ 4 



2 

IT 



N 



T{1 - T) 



2 —2 
ZT - N«T 



(5 ) 



ERIC 



For our synthetic data, 

0.124 

r.. = = 0.498 

0.249 



The reader should keep in mind that r.^ is the 
proportion of variance in observed scores not due to 
guessing, whereas (1 - r^^) is the proportion of variance 
in observed scores due to guessing. Now, we call r^. 

the reliability of a one-item test; however, if^there 
are random errors operating other than thoise due to guessing 
then r^ will be an upper- limit to the "true" reliability 
of the i,tem. 

In order to estimate the reliability of a test con- 
sisting of K replications of the item, we^ can use the 
Spearman-Brown Prophecy Formula 

Krvi 

r = (6 ) 

1 - (K - l)r^^ 

Another way to view the reliability of a ^ne-item 
test is to €^sk how many items of a similar nature would 
have to be administered in order to obtain a given level 
of reliability. This question can be answered by 
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re-arranging the terms in the Spearman-Brown Prophecy 
Formula in order to get 

K = , (7 ) 

where y in this case, r^^j^ is the level of reliability 
desired and K is the number of items necessary to 
achieve this level of reliability. Using our synthetic 
data, if we set r^^j^ = 0.90, then 

0:90(1- 0.498) 

K = = 9.072 . 

0.498(1 - 0.90) 

One further statistic, of a reliability nature, may 
be of interest. It can be shown that the probability that 
a randomly selected student would maintain his or her 
observed score on L = 2 or 3 administrations of the same 
item is: 

= 1 -L^l . ( 8 ) 

For our synthetic data, 

P2 = 1 - 2(0.125) = 0.750 
and P3 = 1 - 3(0.125) = 6.635 . 



Regression of observed scores on true scores . The 
standard error of measurement is the square root of the 
expression in (6tu4) , which is also equal to 



= ^ ^11 ^ 9 ^ 

For our synthetic data, ^ 

i 

Sp = y0.l25 = 0.354 \/ 



or s = yo.249 Jl - 0.498 = 0.35- 



r 
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The reader should recall that the standard error of 
measurement is associated with the regression of observed 
scores on true scores, as indicated, for^our synthetic 
data, in Figure 6-1 • This regression is used to predict 
observed scores from true scores. As such, this regres- 
sion can be used to establish a confidence inct^rval around 
the expected difficulty level of the item, where diffi- 
culty level is based on the classical scoring procedure 
and is merely the proportion of subjects who get an item 
correct. 

Regression of true scores on observed scores . The 
other regression of interest is'^The regression of true 
scores on observed scores. From c)^assical test theory, 
this regression is; 



T = T(l - r^^) + r^^X 



(in ) 



where T is the estimated valua of T assuming a linear 
regression of true on observed scores. The standard 
deviation of errors about this regression is called the 
standard error of estimate and denoted s . , For the 
kind of data considered here, it can be ®^ shown that 



est 



ZT - ZT' 



N 

= ^11 



] [ 



NZT - (ZT) 



( 11 ) 



Now, since there are only two possible observed scores 
for an item (0 and 1) it is also true that 



est 



WrtS ^in\ + w, s__-^,,x t where 



0°est(0) 



'l"est(l) 



( 12 ) 



'est(O) 



the variance of the errors about the 
regression line when X = 0 



* 2 3' 
ZT*^ - ZT'' 



w« = 



N - ZT 
1 - T , 



ZT - ZT' 



N - ZT 



( 13 ) 
( 14 ) 
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FIGURE -1 




> 



the error scores 
ssion line when X = 1 

2 

.9 and (15 ) 

T . ( 16 ) 



Figure 2 provides ^ for our synthetic data, the regres- 
sions of true scores on observed scores, as well ais the 
values of the statistics indicated in ( 11) , ( 13) , and 
( .15). 
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FIGURE 2 



Regression of True Scores on Observed Scores 
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Data Analysis ' 



Design for Data Collection . In the fall of 1972 and the 
spring of 1973 two forms (A and B) of a 25-iJteiti criterion- 
referenced test for a course in educational measurement were 
administered in both the pre- and posttest mode to 113 students, 

In order to understand the design used for administering 

these tests, the reader will find it useful to refer to the 

format of Tables 5-7. In these tables the following notation 
is used: 



Factor 
a3 

a3 

b3 
C 

c 

D 
D 



Level 



a2 

b]^,b2#b3,b4 

^1 
C2 

dn 



Description 

test administered 
using SCoRule^ 

test administered 
using "star" technique 

blocks of s\abjects 

Form A of test 

Form B of test 

Pretest 

Pos ttes t ^ 



AlsOj_no±e that a "." in place of a subscript indicates 
theMnean over all levels of the factor being considered. 



^See Brennan (1974) for a more complete version of the 
analysis of the data reported here. 

^The SCoRule is a mechanical device that aids students in 
assigning subjective probabilities and determining log s'cbres. 

^Factors A and B should not be confused with forms A 
and B of the Pretest and the Posttest. 
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TABLE 5 

Means and Standard Deviations — Pretest and Posttest 
VAR{1) = Arithmetic Mean of Item Confidence Scores 





Pretest 


Posttest 


N 


Fm A Fm B 


Fm A Fm B 
c^d2 C2d2 




• Jlz 


c c c 


21 




.060 


.142 






.373 


.602. 


10 




.070 


.092 




a^b2 


.332 


.499 


19 




.046 


.150 




^2^2 


.342 


CI £ 

. bib 


3 




.037 


.116 






..330 


.535 






.049 


.109 






o o o 


CA C 


9 




.064 


.119 






.322 


.493 






.066 


M66 






.370 


.^25 


8 




.064 


.141 




a b 


.333 .333 


.556 .518 


113 


• • 


.057 .061 


.120 ,153 





-2i-;3a- 



TABLE 6 

Means and Standard Deviations — 



Pretest and Posttest 



VAR(^) = Arithmetic Mean of Item Pseudo-Classical Scores 



Pretest Posttest 





Fm A 


Fm B 


Fm A 


Fm B 


N 






^2<^1 








.360 
.086 




.658 
.132 




21 




.424 
.081 




.700 
.069 




10 




.378 
.071 






.569 
.156 


19 


^2^2 


.403 
.046 






.573 
.108 


9 




1 


.367 
.071 


.666 
.115 




17 






.386 
.096 


.634 
.143 




9 


^4 




.357 
.078 




.626 
.150 


20 






.427 
.091 




.737 
.097 


8 


a b 
• • 


.383 
.077 


.376 
.082 


.664 
.118 


.614 
.148 


113 



o 
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TABLE .7 

Means and Standard Deviations — Pretest and Posttest 



VAR (3) = Arithmetic 


Mean of Classical Scores 


Pretest 


Posttest 




Fm A Fm B 


Fm A Fm B 


M 




^1^2 ^2*^2 


a.b, .404 
.089 


.691 
.118 


21 


a,b, .464 
^ ^ .076 


.700 
.063 


10 


a,b, .444 
^ ^ .080 


.ff34 

.146 


19 


a b .418 
.098 


.578 
.104 


9 


a,b^ .419 
^ 3 ' .108 


.678 
.122 


17 


a,b_ .449 
^ ■ .115 


.662 
.122 


9 


a,b. .4l4 
^ ^ . : .090 


.644 
.135 


20 


a,b. .430 
^ * .102 


.760 
.117 


8 


a b .429 .424 
• • .087 ' .100 


.685 .646 
.110 .139 


113 



The reader should note several Important facts 
about this design: 



(a) If we collapse the levels of the A factor, 
we see that subjects in the first' block received 
Pretest A and Post test A, siabjects in the second block 
received Pretest A and Posttest subjects in the 
ihird block received Pretest B and Posttest A, and 
subjects in the fourth block received Pretest B and 
Posttest B. Furthermore, note that subjects were 
randomly assigned to blocks. 



^Ttb) The discussion above indicates that the design 
is a (balanced) repeated measures design in which half 
of the available cells are enpty; i.e., each subject 
took one form of the Pretest and one form of the Posttest, 
and, thus, no subject took both forms of either the 
Pretest or the Posttest. In the opinion of this author, 
the constraints incorporated in the design are realistic 
in that it is often not feasible to obtain repeated 
measures for equivalent tests in the real world of 
course development and evaluation. 

(c) Although the constraint mentioned above is 
realistic, i|: is , neverthelesar, somewhat restricting. 
For example, we cannot obtain direct measures of the 
equivalence of the two forms of the Pre- and Posttests. 
Also, when we examine aimimary statistics for tests and 
items, these statistics sometimes will be based upon 
different or partially overlapping samples of subjects. 

Another important aspect of the data collection 
procedure involves the way in which students responded 
to test items. For each item, each student identified 
the alternative he or she would pick if forced to pick 
one and only one alternative; also, each student 
indirectly reported his or her. subjective probabilities 
for each alternative for each item. Subjects in level a^ 
reported actual log Scores (range of 0 to 100) for each 
alternative using a mechanical device called a. SCoRi^e; 
these log scores were later transformed into subjective 
probabilities. Students in level a^ used the tv^elve- 
point "star" system for reporting their subjective 
probabilities. 

Summary Statistics for Subjects and Tests . Tables 
5-7 report means and standard deviations over tests and 
persons for: 




VAR(l) 



Arithmetic mean of item confidence scores; 
i.e. / each siabject's score is the arithmetic 
mean of the subjective probabilities 
associated with the correct answer to each 
item. (Range = 0 to 1.) 



VAR{2) 



Arithmetic mean of item pseudo-class'dcal 
scores, which are estimated from the 
subject's subjective probabilities. (Range 
0 to 1 . 



Arithmetic mean of classical item scores, 
which are determined directly from the 
"pick one" procedure. (Range = 0 to 1.) 



Tables 5-7 are presented for the reader who is 
interested in comparing the different types of scores 
discussed above. For our purposes, in this chapter, we 
will concentrate primarily upon pseudo-classical s cores • 
Recall that pseudo-classical scores are estimated classical 
scores which are determined from the siobjective proba-- 
bilities assigned by subjects to the alternatives of 
test items. As indicated previously, pseudo-classical 
scores are much less affected by guessing than are classical 
scores, one can directly determine a jkind of item relia- 
bility from pseudo- class icaljLtem scores, euid pseudo- 
classical scores afe easiiywiterpreted. Pseudo-classical 
scores, in fact, appear to have most of the advantages 
and'^Iew of the disa^d vantages of both classical scores 
and isubjective probabila ties. 

In short, in the opinion of this author, pseudo- 
classical scores have considerable promise as a basis for 
validating criterion- referenced, mastery, and possibly 
norm-referenced tesit items* It should be notied that once 
an item has been validated using pseudo-classical scores, 
one can logically consider subsequently administering^* 
and scoring the item using classical procedures. 

In the next section we will analyze each of the 
items that make up both forms of the criterion-referenced 
Pretest and Posttest. In this section we will continue 
to emphasize pseudo-classical item scores, although we 
will, on occasion, report statistics based upon subjective 
probabilities associated with items and classical item 
scores* 
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Item Statist! cs . Let us review the nature of each 
Of the tests co aside red here* There ^are tv/o forms 
(A and B) of the Pretest and two forms (A and B) of the 
Posttest. Pretest A and Posttest A are identical^ item 
by item, and the same is true of Pretest B and Posttest 
B, If we let "i" be a generic item number, then item i 
on Form' A (in both the Pre- and Posttest) is intended to 
be equivalent to item i on Form W (in both the Pre- and 
Posttest). In brief r there are two different tests, 
or sets of items (Form A and Form B) administered at two 
different times (Pretest and Posttest). Consequently , 
a complete analysis of item equivalence must consider » 
the issue of equivalence for each item for both the 
Pretest and Posttest mode. 

, If we generalize from classical procedures for 
testing the equivalence of two tests, we would test ther-* 
equivalence of two items in, say, the Posttest mode, by 
administering both items to the same set of subjects 
at the time of the Posttest. Then, if the means and 
standard deviations of the two items were the same, we 
could claim that the two items are equivalent, and the 
correlation between the item scores for the two items 
could be interpreted as a coefficient of equivalence 
for tiie item. However, the design used to collect our 
data will not penrdt such a procedure since, as indica- 
ted previously, the same subjects never take both forms 
of an item in either the Pretest or the Posttest mode. 

In short, we cannot obtain a direct measure of item 
equivalence for the two forms of any item given the 
design for data collection employed here. However, since 
subjects were randomly assigned to blocks, and since, 
for the most part, there are no significant differences 
between block means for the Pre- and Posttests, we can 
partially consider the statistical issue of item eqxiiva- 
lence by examining the dilfetences between Form A and 
Form B item means and standard* deviations . Tables 8 to 10 
present the appropriate item statistics when items are 
scored using subjective (confidence) probabilities, 
^ classical scores, and pseudo-classical scores, respec- 
tively. 
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Let us consider Table 10 , which is based upon 
speudo-classical item scores^ in some detail. The means 
reported can be interpreted in a mar»ner similar to 
item difficulty levels. The difference between means 
for the two forms of any item is tested using j a t-test 
for independent samples. The equivalence of item stan- 
dar'd deviations is tested using the FMAX statistic^ 
which is the ratio of the larger variance divided by the 
smaller variance^ and which has an F-distribution. Since 
we are performing multiple tests of significance it is 
advisable to distribute the a-level (•05) equally over 
all 25-items; thus ^ it is advisable to consider a differ- 
ence or FMAX^alue to be significant only if p<.00 2 = 
.05/25. 

In addition to comparing means and standard, devia- 
tions for the two forms ofany item^ when we use pseudo- 
classical scores, we can also compare the item relia- 
bilities discussed previously. These reliabilities 
are provided in Table 11. 

We can summarize the critical information in 
Tables 10 : and 11 in the following manner. 



Item Pretest - Differences in : Posttest Differences in 

No. Mn's SD^s r^ Mn's SD^s r's 

2 X X 

3 X 
7 X 

9 X 

11 XXX 

13 X 

14 X 

15 X 

21 XX 

22 X X , 

23 X X 

24 X X 



In the above table, an "x" appears only if p<,002, and 
the items listed are only those for which at least one 
pretest or posttest difference is significant at p<.002. 
Clearly there is some evidence that the two forms of 
some items are not equivalent, for either the pretest 
mode or the posttest mode or both modes. Note that if 
two items are equivalent when administered in the pretest 
mode, this does not guarantee that the items will be 
equivalent when administered in the posttest mode, and 
vice-versa. 
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Table 12 lists the item means (for the two forms 
of the pre- ahd posttest) in a format somewhat different 
from that used in Tables 8*10 • Using Table 12 the reader 
can readily examine the magnitude aind direction of the 
differences among average confidence probabilities, 
pSeudo-classical, and classical scores for each item 
in eadi form of each test. It is especially instructive 
to examine the differences between pseudo-classical and 
classical scores. Roughly speaking these differences 
are greater for the pretest than for the posttest; this 
observation coincides with the fact tliat pretest item 
reliabilities are generally lower than posttest item 
reliabilities. 

Table 13 presents correlation coefficients^ based 
on the data in Table 12. At least three observations can 
be made from Table 13: 

(a) The correlation between confidt^uce probabilities 
and classical scores is consistently less than the 
correlation between pseudo-classical scores and classical 
scores; 

(b) The correlation between confidence probabilities 
and classical scores is consistently less than the corre- 
lation between confidence probabilities and pseudo-classical 
scores; and 

(c) Pretest correlations are consistently less than 
posttest correlations. This is especially true for the 
correlations between pseudo-'Classical and classical scores. 
This latter observation is to be expected since pretest 
reliabilities are consistently less than posttest reliabilities < 

Summary 

This paper should be interpreted as a tentative 
attempt to explore the use of subjective probabilities 
(such as those collected when one administers a test in 
the confidence testing manner) in the analysis of item 
data, especially criterion- referenced item data. 

There are two important assumptions implicit 
in this paper* (a) one wants to obtain a maximum eunount 
of information vith respect to an item using a minimimi 
number of subjects and (b) once the item is validated 
it may well be administered in the classical correct/ 
incorrect manner. It appears to this author that one way 
to satisfy these assumptions is to initially administer 
the unvalidated item to a small n\uid>er of subjects (say, 
20-25) using confidence testing procedures. Then one 
can translate the subjective probabilities to pseudo- 
classical item scores which are, at least theoretically. 
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guessing- free • Using pseudo**classical scores # one can 
construct a relatively sophisticated item analysis table, 
calculate typical item statistics, ^and, in addition, 
assess a kind of item reliability independent of total 
test reliability. 

Pseudo-classical scores are, in fact, estimates of 
classical scores; therefore, once the item is validated 
using pseudo-classical scores, one can subsequently 
/administer and score items in the classical manner. 
Thus, pseudo-classical scores provide, I think, a useful 
bridge between svibjective probabilities and classical 
correct/incorrect scores. As such, pseudo-classical scores 
appear to be of potential use in the analysis of criterion- 
referenced items. 
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TABLE 13 



CORRELATIONS AMONG THREE DIFFERENT ITEM SCORES 



Pretest Form A 

VAR(l) VAR(2) 

VAR(l) VAR(l) 

VAR(2) .724 VAR(2) 

VAR(3) .460 .546 VAR(3) 



Pretest Form B 
YAR(l) VAR(2) 

.853 

.498 .668 



Posttest Form A 
VAR(l) VAR(2) 



VAR(l) 

VAR{2) .849 
VAR(3) .772 



.908 



Posttest Form B 
VAR(l) VAR(2) 

VAR(l) 

VAP(2) .865 

VAR(3) .857 .939 



NOTE. — VAR(l) = arithmetic mean of confidehce probabilities; 

VAR{2) = arithmetic mean of pseudo-classical scores; 
VAR(3) = arithmetic mean of classical scores. 
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